The aim of this mini-project is to implement and compare the Variational Bayes and Metropolis-Hastings methods. We consider the Gaussian mixture example : given a dataset, the goal is to estimate the mixture distribution. For the sake of simplicity, we consider in this project a synthetic dataset that we generate ourselves. This allows us to assess the correctness of our work, by comparison with the ground truth.
In this first section, we generate a bi-variate dataset \(X_{1:N}\) assumed to be \(i.i.d.\) according to the density \[ f(x|\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k}) = \sum_{j=1}^k \rho_j\mathcal{N}(x|\mu_j,\Sigma_j) \] where \[ \left\{ \begin{array}{ll} N = 500 \\ k = 3\\ \rho_{1:3} = \begin{bmatrix} \frac{4}{10} & \frac{3}{10} & \frac{3}{10} \end{bmatrix}\\ \mu_{1:3} = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}\\ \nu = 4\\ S = \begin{bmatrix} 0.02 & 0 \\ 0 & 0.02 \end{bmatrix}\\ \forall i \in \{1,...,k\}\ \ \Sigma_i \sim \mathcal{W}(\nu,S)\\ \end{array} \right. \] The model can be rewritten using hidden variables \(\xi_{1:N}\) with \(\xi_i \in \{1,...k\}\) : \[ \begin{aligned} \xi_i &\sim \textrm{Multinomial}(\rho)\\ \mathcal{L}&(X_i|\xi_i=j) = \mathcal{N}(\mu_j,\Sigma_j)\ \ j\in \{1,...,k\} \end{aligned} \] It is mathematically convenient to represent the multimodal variable \(\xi_i\) by a binary vector \(z_i\) of size \(k\). The parameter for this Gaussian mixture is thus \(\theta = (\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k})\), which we shall represent in \(\texttt{R}\) as three objects : \(\texttt{p}\), \(\texttt{Mu}\), \(\texttt{Sigma}\), with \(\texttt{p}\) a vector of size \(k\) (standing for \(\rho\)), \(\texttt{Mu}\) a \(k\) × \(d\) matrix and \(\texttt{Sigma}\) a \(d\) × \(d\) × \(k\) array.
Let’s plot the generated data using one color for each mixture component : Now, we can add on the same plot, the center of each Gaussian mixture component along with an ellipsoid around the mean of each Gaussian distribution which corresponds to the density level set of the univariate \(0.95\) quantile.
To estimate the mixture distribution from the dataset, we start with the Variational Bayes approach. We use a mean field variational distribution : \(q(z,\rho_{1:k},\mu_{1:k},\Lambda_{1:k}) = q(z)q(\rho_{1:k},\mu_{1:k},\Lambda_{1:k})\). It is the only assumption that we need to make in order to obtain a tractable practical solution to our Bayesian mixture model, which will be dertermined by optimization of the variational distribution.
We choose a prior as follows:
\(\rho_{1:k} \sim \textrm{Dir}(\alpha)\) where \(\alpha\) has size \(k\) and \(\alpha=(\alpha_0,...,\alpha_0)^T\) with \(\alpha_0 > 0\);
\(\forall j \in \{1,...,k\},\ \ (\mu_{j},\Lambda_{j}) \sim \mathcal{N} (\mu_j|m_0,(\beta_0\Lambda_j)^{-1})\mathcal{W}(\Lambda_j|W_0,\nu_0)\ \ \textrm{where}\ \Lambda_{j} = (\Sigma_{j})^{-1}\);
\(\rho_{1:k}\) and the pairs \((\mu_j,\Lambda_j)\) are independent.
From the Bishop (2006), we know that the logarithm of the optimal factor is given by : \[ \begin{aligned} \ln q^{\ast}(z) = \sum_{n=1}^N\sum_{j=1}^k z_{nj}\ln \gamma_{nj} + \textrm{const} \end{aligned} \] where : \[ \begin{aligned} &\ln \gamma_{nj} = \mathbb{E}[\ln \rho_{j}] + \frac{1}{2}\mathbb{E}[\ln |\Lambda_{j}|] - \frac{d}{2}\ln (2\pi) -\frac{1}{2} \mathbb{E}_{\mu_{j},\Lambda_{j}}[(x_n-\mu_j)^T\Lambda_j(x_n-\mu_j)]\\ &d = 2 \ \ (X \textrm{ is a bi-variate dataset}) \end{aligned} \]
We obtain : \[ q^{\ast}(z) \propto \prod_{n=1}^N\prod_{j=1}^k r_{nj}^{z_{nj}} \] where the quantities \(r_{nj} = \frac{\gamma_{nj}}{\sum_{j=1}^k \gamma_{nj}}\) are playing the role of responsibilities.
At this point, it is convenient to define three statistics of the observed dataset evaluated with respect to the responsibilities :
\(\forall j \in \{1,...,k\},\) \[ \begin{aligned} N_j &= \sum_{n=1}^Nr_{nj}\\ \bar{x}_j &= \frac{1}{N_j}\sum_{n=1}^Nr_{nj}x_n\\ S_j &= \frac{1}{N_j}\sum_{n=1}^Nr_{nj}(x_n-\bar{x}_j)(x_n-\bar{x}_j)^T\\ \end{aligned} \]
From the Bishop (2006), we have : \[ \begin{aligned} q(\rho_{1:k},\mu_{1:k},\Lambda_{1:k}) = q(\rho_{1:k}) \prod_{j=1}^k q(\mu_{j},\Lambda_{j}) \end{aligned} \] and : \[ \ln q^{\ast}(\rho_{1:k}) = (\alpha_0 - 1) \sum_{j=1}^k \ln \rho_j + \sum_{j=1}^k \sum_{n=1}^N r_{nj}\ln \rho_j+ \textrm{const} \] We recognize \(q^{\ast}(\rho_{1:k})\) as a Dirichlet distribution : \[ \begin{aligned} &q^{\ast}(\rho_{1:k}) = \textrm{Dir}(\rho_{1:k}| \alpha_{1:k})\\ &\alpha_j = \alpha_0 + N_j\ \ \ \forall j \in \{1,...,k\} \end{aligned} \] Finally, the variational posterior distribution \(q^{*}(\mu_{1:k},\Lambda_{1:k}) = q^{*} (\mu_{1:k} | \Lambda_{1:k}) q^{*}(\Lambda_{1:k})\) is a Gaussian-Wishart distribution and is given by : \[ q^{\ast}(\mu_{j},\Lambda_{j}) = \mathcal{N} (\mu_j|m_j,(\beta_j\Lambda_j)^{-1})\mathcal{W}(\Lambda_j|W_j,\nu_j) \] where \[ \begin{aligned} \beta_j &= \beta_0 + N_j\\ m_j &= \frac{1}{\beta_j}(\beta_0m_0+N_j\bar{x}_j)\\ W_j^{-1} &= W_0^{-1} + N_jS_j + \frac{\beta_0N_j}{\beta_0+N_j}(\bar{x}_j-m_0)(\bar{x}_j-m_0)^T \\ \nu_j &= \nu_0 + N_j \end{aligned} \] In order to performn this variational M-step, we need the expectations \(\mathbb{E}[z_{nj}] = r_{nj}\) representing the responsibilities. This involves the following results : \[ \begin{aligned} \mathbb{E}_{\mu_j,\Lambda_j}[(x_n-\mu_j)^T\Lambda_j(x_n-\mu_j)] &= \frac{d}{\beta_j}+ \nu_j(x_n-m_j)^TW_j(x_n-m_j)\\ \ln \tilde{\Lambda}_j \equiv \mathbb{E}[\ln |\Lambda_j|] &= \sum_{i=1}^d\psi \bigg(\frac{\nu_j+1-i}{2}\bigg) + d\ln 2 + \ln |W_j|\\ \ln \tilde{\rho}_j \equiv \mathbb{E}[\ln \rho_j] &= \psi (\alpha_j) - \psi (\hat{\alpha}) \ \ \textrm{where}\ \ \hat{\alpha} = \sum_{j=1}^k \alpha_j \end{aligned} \] Finally, we obtain the following result for the responsibilities : \[ \begin{aligned} r_{nj} &\propto \rho_j|\Lambda_j|^{1/2} \exp \bigg\{- \frac{1}{2}(x_n-\mu_j)^T\Lambda_j(x_n-\mu_j) \bigg\} \\ \end{aligned} \]
As a stopping criterion, we use the squared \(L_2\) norm between two successive values of the parameters. While this criterion is greater than some tolerance threshold (e.g. \(10^{-6}\)) and the number of iterations is lower than \(200\), we repeat the following steps :
\(\textbf{E-step}\) : given the dataset \(X_{1:N}\) and the current values of \(\alpha_{1:k}\), \(\beta_{1:k}\), \(\nu_{1:k}\), \(m_{1:k}\) and \(W^{-1}_{1:k}\), computing the responsibilities \(r_{nj}\).
\(\textbf{M-step}\) : given the current responsibilities \(r_{nj}\) and the dataset \(X_{1:N}\), computing \(\alpha_{1:k}\), \(\beta_{1:k}\), \(\nu_{1:k}\), \(m_{1:k}\) and \(W^{-1}_{1:k}\).
\(\textbf{Stopping criterion}\) : computing the criterion of the current parameters \(\alpha_{1:k}\), \(\beta_{1:k}\), \(\nu_{1:k}\), \(m_{1:k}\) and \(W^{-1}_{1:k}\).
To choose the starting values automatically, we use the function \(\texttt{initPar}\), which runs a simple k-means. Therefore, the initial values of the \(k\) mixture components are already close to the true Gaussian mixture parameters.
We complete the functions \(\texttt{vbMstep}\), \(\texttt{vbEstep}\) and \(\texttt{vbalgo}\).
To run this algorithm, we decide to set \(\texttt{Kfit} = 5\). Let’s show the hyper-parameters of the model :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 10
Kfit <- 5
From this last figure, the model seems to understand that we only have \(3\) true mixture components instead of \(5\), as there are \(2\) peaks in the criterion values over iterations. Let’s show a summary of Variational Bayes’s output :
T
## [1] 125
outputvb$Alphamat[,T]
## [1] 151.7898 0.1000 197.4801 0.1000 151.0301
outputvb$Marray[,,T]
## [,1] [,2]
## [1,] -5.887715e-03 1.015828e+00
## [2,] 4.166578e-10 -2.427336e-10
## [3,] 4.779915e-03 1.326847e-02
## [4,] 4.166578e-10 -2.427336e-10
## [5,] 1.030213e+00 -6.743710e-02
To assess the quality of the ouput, we first compare the ground truth (true mixture density), with the density of the mixture associated with point estimates constructed from the Variational Bayes output. Namely, we choose as a parameter estimate, the posterior expectancy of this parameter in the variational approximation.
Refering to Bishop (2006), we have :
\(\forall j \in \{1,...,k\}\)
\[ \begin{aligned} \hat{\rho}& = \mathbb{E}\big[\rho|\alpha_{\textrm{VB},1}, ...,\alpha_{\textrm{VB},k}\big] \ \textrm{with}\ \rho \sim \textrm{Dir}(\alpha_{\textrm{VB},1}, ...,\alpha_{\textrm{VB},k}) \\ \hat{\mu}_j &= \mathbb{E}\big[\mu_j|m_{\textrm{VB},j},\beta_{\textrm{VB},j},\Lambda_{\textrm{VB},j}\big] \ \textrm{with}\ \mu_j \sim \mathcal{N}(m_{\textrm{VB},j},(\beta_{\textrm{VB},j}\Lambda_{\textrm{VB},j})^{-1}) \\ \hat{\Lambda}_j &= \mathbb{E}\big[\Lambda_j|W_{\textrm{VB},j},\nu_{\textrm{VB},j}\big] \ \textrm{with}\ \Lambda_j \sim \mathcal{W}((W\textrm{inv}_{\textrm{VB},j})^{-1},\nu_{\textrm{VB},j}) \\ \end{aligned} \]
\(\forall j \in \{1,...,k\}\)
\[ \begin{aligned} \hat{\rho}_j &= \frac{\alpha_{\textrm{VB},j}}{\sum_{i=1}^k \alpha_{\textrm{VB},i} } \\ \hat{\mu}_j &= m_{\textrm{VB},j} \\ \hat{\Lambda}_j &= \nu_{\textrm{VB},j}(W\textrm{inv}_{\textrm{VB},j})^{-1}\\ \hat{\Sigma}_j &= \frac{W\textrm{inv}_{\textrm{VB},j}}{ \nu_{\textrm{VB},j}} \end{aligned} \]
Therefore, we complete the code as follows :
p_vb <- outputvb$Alphamat[,T] / sum(outputvb$Alphamat[,T])
Mu_vb <- outputvb$Marray[,,T]
Sigma_vb <- array(dim=c(d,d,Kfit))
for(j in 1:Kfit){
Sigma_vb[,,j] <- outputvb$Winvarray[,,j,T] / outputvb$Numat[j,T]
}
Let’s show a summary of our point estimates :
p_vb
## [1] 0.3032763623 0.0001998002 0.3945655408 0.0001998002 0.3017584966
Mu_vb
## [,1] [,2]
## [1,] -5.887715e-03 1.015828e+00
## [2,] 4.166578e-10 -2.427336e-10
## [3,] 4.779915e-03 1.326847e-02
## [4,] 4.166578e-10 -2.427336e-10
## [5,] 1.030213e+00 -6.743710e-02
Sigma_vb
## , , 1
##
## [,1] [,2]
## [1,] 0.09825036 -0.03447398
## [2,] -0.03447398 0.04518358
##
## , , 2
##
## [,1] [,2]
## [1,] 1.000000e-01 2.033694e-13
## [2,] 2.033694e-13 1.000000e-01
##
## , , 3
##
## [,1] [,2]
## [1,] 0.04230017 0.03596338
## [2,] 0.03596338 0.14300179
##
## , , 4
##
## [,1] [,2]
## [1,] 1.000000e-01 2.033694e-13
## [2,] 2.033694e-13 1.000000e-01
##
## , , 5
##
## [,1] [,2]
## [1,] 0.09257212 0.03695584
## [2,] 0.03695584 0.08912446
As a confirmation of the last figure, we can see that \(2\) of the estimated mixture components have negligible weights (as well as negligible variances). Therefore, the model managed to understand the underlying structure of the data, and converged pretty quickly to the true Gaussian mixture parameters due to the accurate k-means initialization. We decide to remove the estimated mixture components with negligible weights (\(\rho_i< 0.001\)). Let’s display the number of estimated components after removing negligible ones :
k_vb
## [1] 3
Now that we have our Variational Bayes mixture estimate, we can assign each data point to one of the estimated Gaussian components. For each data point, we will keep a colored cycle, representing the true Gaussian components, and we will color each cycle according to the estimated Gaussian components.
In this section, we study the influence of the hyper-parameter \(\alpha_0\). It can be interpreted as the effective prior number of observations associated with each component of the mixture. If its value is small, then the posterior distribution will be influenced primarily by the data rather than by the prior. In the previous figure, we had \(\alpha_0 = 0.1\), this is why the posterior distribution was highly influenced by the dataset and automatically granted negligible weights to two of the estimated components. Let’s try to set \(\alpha_0 = 0.6\) and to keep \(\texttt{Kfit} = 5\).
alpha0 <- 0.6
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 10
Kfit <- 5
T
## [1] 133
p_vb
## [1] 0.302833199 0.001192858 0.393522103 0.001192858 0.301258982
outputvb$Marray[,,T]
## [,1] [,2]
## [1,] -5.824610e-03 1.015764e+00
## [2,] 2.995995e-06 -1.743520e-06
## [3,] 4.729005e-03 1.314332e-02
## [4,] 2.995995e-06 -1.743520e-06
## [5,] 1.030197e+00 -6.744155e-02
Again, two of the estimated components are granted very small weights, which are about ten times higher than the weights we had with \(\alpha_0 = 0.1\).
k_vb
## [1] 5
Here, all the estimated components are kept, even if two of them are not significant. Let’s check that with \(\alpha_0 = 1.2\), we do not obtain a sparse solution.
alpha0 <- 1.2
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 10
Kfit <- 5
Higher values of \(\alpha_0\) lead to a higher number of iterations to achieve convergence.
T
## [1] 143
p_vb
## [1] 0.302306126 0.002371593 0.392284661 0.002371593 0.300666026
outputvb$Marray[,,T]
## [,1] [,2]
## [1,] -5.749889e-03 1.015688e+00
## [2,] 1.045052e-05 -6.074111e-06
## [3,] 4.668637e-03 1.299509e-02
## [4,] 1.045052e-05 -6.074111e-06
## [5,] 1.030178e+00 -6.744685e-02
k_vb
## [1] 5
Again, two of the estimated components are granted small weights, however, this does not lead to a sparse soution. The higher \(\alpha_0\) is, the more the posterior distribution is influenced by the prior. With high values, it takes the model more time to converge to the true values and the model is less likely to understand the true number of Gaussian mixture components. However, when \(\alpha_0\) is low, the posterior distribution is more influenced by the dataset and the model grants automatically negligible weights to some estimated mixture components so that the true number of components is automatically recoverd.
In this section, we study the influence of the other hyper-parameters \(m_0\), \(\nu_0\), \(\beta_0\) and \(W_0\). The first thing to say is that, due to the very accurate initialization, small variations of the hyper-parameters will not change the results of the algorithm, as the parameters are already close to the true mixture distribution. However, if we significantly change their values, we will start to see different behaviors depending on the role of each hyper-parameter. Let’s set \(\texttt{Kfit} = 3\).
\(m_0\) is the Gaussian prior mean governing the mean of each Gaussian component. Let’s change the value of \(m_0\) from \([0\ 0]\) to \([30\ 30]\) :
alpha0 <- 0.1
m0 <- rep(30,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 10
Kfit <- k_true
T
## [1] 49
p_vb
## [1] 0.0001998801 0.9996002399 0.0001998801
Mu_vb
## [,1] [,2]
## [1,] 30.0000000 30.0000000
## [2,] 0.3172233 0.2991954
## [3,] 30.0000000 30.0000000
Sigma_vb
## , , 1
##
## [,1] [,2]
## [1,] 1e-01 9e-299
## [2,] 9e-299 1e-01
##
## , , 2
##
## [,1] [,2]
## [1,] 0.46575361 0.07642721
## [2,] 0.07642721 0.49390382
##
## , , 3
##
## [,1] [,2]
## [1,] 1e-01 9e-299
## [2,] 9e-299 1e-01
When looking at the estimated Gaussian components’ means, we observe that one of the estimated component converges over iterations to a big component wrapping the entire dataset. Indeed, its mean corresponds approximately to the mean of the dataset. The other estimated components shrink over iterations and end up around the same position \([30\ \ 30]\) with a large precision. Therefore, those components do not contain any data point. This is due to the fact that \(m_0\) was initialized far from any value of the dataset. Let’s plot the results :
\(\nu_0\) is a Wishart prior parameter governing the precision of each Gaussian component. Let’s change the value of \(\nu_0\) from \(10\) to \(300\) :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 300
Kfit <- k_true
With \(\nu_0 = 300\), the convergence is achieved pretty quickly.
Let’s change \(\nu_0\) from \(300\) to \(1\) :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 1
Kfit <- k_true
With \(\nu_0 = 1\), the model takes more time to converge.
When the hyper-paramter \(\nu_0\) is very high, the posterior estimate \(\Sigma_{1:k}\) has low values and the algorithm converges quickly. However, when \(\nu_0\) is very low, the posterior estimate \(\Sigma_{1:k}\) has high values and the algorithm takes more time to converge.
\(\beta_0\) is a Gaussian prior parameter governing the variance of the mean of each Gaussian component. Let’s change the value of \(\beta_0\) from \(0.1\) to \(30\) :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 30
W0 <- 1*diag(d)
nu0 <- 10
Kfit <- k_true
With \(\beta_0 = 30\), the model takes more time to converge.
\(\beta_0\) changes the initialization of the mean of each mixture component as their variances are inversely proportional to \(\beta_0\). Therefore, the initial mean values are redefined from an optimal position (the centers of three clusters after the k-means algorithm) to a different position. Therefore, the algorithm evolves finding new components, different from the true Gaussian components, and this might as well explain why it takes a longer time to converge, as the initial mean values are not close to the true Gaussian components.
\(W_0\) is a Wishart prior parameter governing the precision of each Gaussian component. Let’s change the value of \(W_0\) from \(1*diag(2)\) to \(1000*diag(2)\) :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1000*diag(d)
nu0 <- 10
Kfit <- k_true
We do not observe any big impact when changing the hyper-parameter \(W_0\).
To sum up, only significant variations of those hyper-parameters have a real impact on the solutions.
In this section, we will try to estimate the mixture distribution using the Metropolis-Hastings algorithm.
The algorithm works by simulating a Markov chain whose stationary distribution is \(\pi\). We know \(\tilde \pi = \lambda \pi\) but \(\lambda\) is unknown. The algorithm is the following :
We provide a transition kernel with density \(q(\theta^{*}|\theta)\) on \(\Theta\).
We choose a prior to initialize \(\theta^{0}\).
We iterate until we reach the stopping criterion :
We sample \(\theta^{*}\) from \(q(\theta^{*}|\theta^{t})\), which is the proposed value for \(\theta^{t+1}\);
Acceptancy probability : we compute
\[ \alpha = \min\bigg( 1,\frac{\tilde \pi (\theta^{*}) q(\theta^{t}|\theta^{*})}{\tilde \pi (\theta^{t}) q(\theta^{*}|\theta^{t})} \bigg) \]
The prior density is implemented in the function \(\texttt{dprior}\). The proposal kernel at current state \(\theta^{t} = (\rho_{1:k}^t,\mu_{1:k}^t,\Sigma_{1:k}^t)\) is \(Q_{\textrm{MH}}(\theta^t,\theta^{*})\), generating a proposal \(\theta^{*} = (\rho_{1:k}^{*},\mu_{1:k}^{*},\Sigma_{1:k}^{*})\) as follows :
\(\rho^{*} \sim \textrm{Diri}(\alpha_p \textrm{ x } \alpha^t)\) where \(\alpha_p>0\);
\(\mu_j^{*} \sim \mathcal{N}(\mu_j^t, \sigma_{\mu}^2I_d)\) where \(\sigma_{\mu}^2\) is a variance parameter;
\(\Sigma_j^{*} \sim \mathcal{W}(\nu_{\Sigma}, (\nu_{\Sigma})^{-1}\Sigma_j^t)\) where \(\nu_{\Sigma}\) is a degree of freedom parameter fixed by user. Thus \(\mathbb{E}_{Q_{\textrm{MH}}}(\Sigma_j^{*}|\Sigma_j^t) = \Sigma_j^t\) and the distribution of \(\Sigma_j^{*}\) is more peaked around \(\Sigma_j^t\) for large values of \(\nu_{\Sigma}\).
We complete the function \(\texttt{rproposal}\) as follows :
Mu[j,] <- rmvn(mu=Mu[j,], Sigma=ppar$var_Mu*diag(d))
Sigma[,,j] <- rwishart(nu=ppar$nu_Sigma, S=Sigma[,,j]/ppar$nu_Sigma)
We have : \[ q_{\textrm{MH}}(\theta^t|\theta^{*}) = q_{\textrm{MH}}(\rho_{1:k}^t|\rho_{1:k}^{*}) q_{\textrm{MH}}(\mu_{1:k}^t|\mu_{1:k}^{*}) q_{\textrm{MH}}(\Sigma_{1:k}^t|\Sigma_{1:k}^{*}) \] Therefore
\[ \log q_{\textrm{MH}}(\theta^t|\theta^{*}) = \log q_{\textrm{MH}}(\rho_{1:k}^t|\rho_{1:k}^{*}) + \log q_{\textrm{MH}}(\mu_{1:k}^t|\mu_{1:k}^{*}) + \log q_{\textrm{MH}}(\Sigma_{1:k}^t|\Sigma_{1:k}^{*}) \] and \[ \begin{aligned} \log \alpha = \min\big(0,&\log \tilde \pi (\theta^{*}) + \log q_{\textrm{MH}}(\rho_{1:k}^t|\rho_{1:k}^{*}) + \log q_{\textrm{MH}}(\mu_{1:k}^t|\mu_{1:k}^{*}) + \log q_{\textrm{MH}}(\Sigma_{1:k}^t|\Sigma_{1:k}^{*}) \\ - &\log \tilde \pi (\theta^{t}) - \log q_{\textrm{MH}}(\rho_{1:k}^{*}|\rho_{1:k}^t) + \log q_{\textrm{MH}}(\mu_{1:k}^{*}|\mu_{1:k}^t) + \log q_{\textrm{MH}}(\Sigma_{1:k}^{*}|\Sigma_{1:k}^t)\big) \end{aligned} \] However \(\forall j \in \{1,...,k\}\), \(\mu_j^{*} \sim \mathcal{N}(\mu_j^t, \sigma_{\mu}^2I_d)\) and \(\mu_j^{t} \sim \mathcal{N}(\mu_j^{*}, \sigma_{\mu}^2I_d)\) therefore \(q_{\textrm{MH}}(\mu_{j}^{*}|\mu_{j}^t) = q_{\textrm{MH}}(\mu_{j}^t|\mu_{j}^{*})\).
Finally we have : \[ \begin{aligned} \log \alpha = \min\big(0,&\log \tilde \pi (\theta^{*}) + \log q_{\textrm{MH}}(\rho_{1:k}^t|\rho_{1:k}^{*}) + \log q_{\textrm{MH}}(\Sigma_{1:k}^t|\Sigma_{1:k}^{*}) \\ - &\log \tilde \pi (\theta^{t}) - \log q_{\textrm{MH}}(\rho_{1:k}^{*}|\rho_{1:k}^t) + \log q_{\textrm{MH}}(\Sigma_{1:k}^{*}|\Sigma_{1:k}^t)\big) \end{aligned} \]
Therefore, we complete the function \(\texttt{MHsample}\) as follows :
llkmoveRho <- ddirichlet(x=proposal$p, alpha=alphaPropmove, log=TRUE)
llkbackRho <- ddirichlet(x=current$p, alpha=alphaPropback, log=TRUE)
lacceptratio <- min(0, proposal$lpost - current$lpost +
llkbackRho - llkmoveRho +
llkbackSigma - llkmoveSigma)
Let’s look at the hyper-parameter values :
alpha0 <- 0.1
m0 <- rep(0,d)
beta0 <- 0.1
W0 <- 1*diag(d)
nu0 <- 10
nsample <- 3000
Kmc <- 3
hpar <- list( alpha0= alpha0, m0=m0, beta0=beta0, W0=W0, nu0=nu0)
ppar <- list(var_Mu=0.001, nu_Sigma=500, alpha_p=500)
Let’s test our code with \(3000\) iterations :
elapsed
## user system elapsed
## 7.678 0.015 7.695
outputmh$naccept ## should not be ridiculously low.
## [1] 174
Among \(3000\) iterations, we only have a few proposed values that were accepted.
Let’s take a look at the mixture distribution estimate :
The Metropolis-Hastings algorithm provides also good results for mixture distribution estimation.
Relevant numerical summaries can be constructed, such as the value of the cumulative distribution function (\(\textit{cdf}\)) at a given point \(y = (y_1,...,y_d)\), \[ F(y|\rho_{1:k}^t,\mu_{1:k}^t,\Sigma_{1:k}^t) = \mathbb{P}(X_1\leq y_1,...,X_d\leq y_d|\rho_{1:k}^t,\mu_{1:k}^t,\Sigma_{1:k}^t),\ t\in (1,...,N_\textrm{sample}) \] where \(N_\textrm{sample}\) is the number of iterations.
To obtain such a time series, we complete the function \(\texttt{cdfTrace}\) as follows :
gmcdf(x, sample$Mu[,,niter], sample$Sigma[,,,niter], sample$p[,niter])
We notice that parameters \(\texttt{thin}\) and \(\texttt{burnin}\) allow respectively to keep only one out of \(\texttt{thin}\) samples and to discard the first \(\texttt{burnin}\) samples.
The Heidelberger and Welch’s approach calculates a test statistic to accept or reject the null hypothesis that the Markov chain is from a stationary distribution. It consists in two parts :
It calculates the test statistic on the whole chain. If null hypothesis is rejected, it discards the first 10% of the chain and calculates the test statistic on the rest of the chain. It repeats this step until null hypothesis is accepted or 50% of the chain is discarded. The latter outcome constitutes “failure” of the stationary test and indicates that the chain needs to be run longer. If the stationarity test is passed, the number of iterations to keep and the number to discard are reported.
The half-width test calculates a 95% confidence interval for the mean, using the portion of the chain which passed the stationarity test. Half the width of this interval is compared with the estimate of the mean. If the ratio between the half-width and the mean is lower than \(\epsilon\), the halfwidth test is passed. Otherwise the chain must be run out longer.
All of this is done using the function \(\texttt{heidel.diag}\) in \(\texttt{R}\).
If the half-width test fails, then the run should be extended, and we should increase the run length by a factor \(I > 1.5\), each time, so that estimate has the same, reasonably large, proportion of new data.
Therefore, an idea to use \(\texttt{heidel.diag}\) and \(\texttt{cdfTrace}\) to propose a reasonable number of iterations to achieve convergence is :
Starting with \(N_\textrm{sample} = 3000\).
Selecting randomly a sample of the dataset \(X_{1:N}\) of size \(8\) and calculating the Heidelberger and Welch’s diagnostic for each of the \(8\) samples.
If both tests pass for every samples, then we devide the number of MCMC iterations by \(1.5\) and we repeat the same step.
If one of the tests of one of the samples fails, then we choose the previous number of MCMC iterations.
With this method, we will find a reasonable number of iterations to achieve convergence. However, due to the samples being selected randomly, we repeat the algorithm \(3\) times and we select the median result as the number of MCMC iterations.
The code is the following :
set.seed(seed)
init <- initPar(x=X, k=Kmc)
set.seed(1)
nb_iterations_ <- c()
for (it in 1:3){
samples <- sample(1:N,8)
nsample <- 3000
nsample_prev <- nsample
outputmh <- MHsample(x=X, k=Kmc, nsample=nsample, init=init, hpar=hpar, ppar=ppar)
pass <- TRUE
while (pass) {
for(n in samples){
res <- heidel.diag(mcmc(cdfTrace(X[n,],outputmh)))
if(res[1]!= 1 | res[4]!=1){
pass <- FALSE
break
}
}
nsample_prev <- nsample
nsample <- round(nsample*2/3)
outputmh <- MHsample(x=X, k=Kmc, nsample=nsample, init=init, hpar=hpar, ppar=ppar)
}
nb_iterations_ <- c(nb_iterations_, nsample_prev)
}
nb_iterations <- median(nb_iterations_)
nb_iterations
## [1] 1333
We find \(1333\) as a reasonable number of iterations to achieve convergence.
We generate three different chains with different starting values. First, let’s look at the plot of each chain :
There are two plots for each \(\textit{cdf}\). The left plot shows the values the \(\textit{cdf}\) took during the runtime of the chain. The right plot is the marginal density plot. It is the smoothened histogram of the values in the \(\texttt{cdfTrace}\) plot, more precisely, the distribution of the values of the \(\textit{cdf}\) in the chain.
We clearly see the acceptancy/rejection pattern, with the several constant portions of the curves.
The Gelman and Rubin’s diagnostic measures whether there is a significant difference between the variance within several chains and the variance between several chains by a value that is called “scale reduction factor”. We can first look at a combination of the three chains :
We can use \(\texttt{gelman.diag}\) to obtain the scale reduction factor for the \(\textit{cdf}\). A factor of \(1\) means that between variance and within chain variance are equal, larger values mean that there is still a notable difference between chains. We can set a tolerance threshold to \(1.1\) and look at the number of iterations that we get.
gelman.diag(outputmh_list)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1.05 1.15
First, we observe that at the end of our \(3000\) iterations, the scale reduction factor is below \(1.1\).
We can see on the plot from the Gelman and Rubin’s diagnostic, that a reasonable number of iterations for the chain to converge would be around \(1400\), which is a value close to the one we found with the Heidelberger and Welch’s test.
The predictive density can be estimated from the Metropolis-Hastings sample by computing the empirical mean of the density over the Metropolis-Hastings output : \[ \hat{f}_{\textrm{MH}}(y) = \frac{1}{M} \sum_{t=1}^Mf(y|\rho_{1:k}^t, \mu_{1:k}^t,\Sigma_{1:k}^t) \] where \(M\) is the number of remaining samples after discarding the burn-in period and thinning.
We complete the code in order to plot the numerical approximation together with the true density. We define \(\texttt{outputmh}\) as one of the three previously generated chains with good p-values for the Heidelberger and Welch’s test and we set \(\texttt{burnin} = 1333\) :
The numerical approximation is close to the true density. Compared to the Variational Bayes which estimated Gaussian distribution support is limited by the true distribution, we can see in the figure that we are not limited with Metropolis-Hastings.
In this section we focus on the probability of an excess of a threshold \(u \in \mathbb{R}\).
\[ \phi(u, \rho_{1:k}, \mu_{1:k}, \Sigma_{1:k}) = 1 - F(u|\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k}) = \mathbb{P}(X_1>u \textrm{ or ... or } X_d>u|\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k}) \] where \(F(y|\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k})\) is the \(\textit{cdf}\) for the Gaussian Mixture (\(\texttt{gmcdf}\) function).
The objective is to compare the performance of the estimators obtained by Variational Bayes and Metropolis-Hastings.
The true \(\phi\) is easily computed using \(\texttt{gmcdf}\).
\(\hat{\phi}_2(u) = \mathbb{E}\big[ \phi (u, \rho_{1:k}, \mu_{1:k}, \Sigma_{1:k})\big]\ with \ (\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k}) \sim Q^{*}\) where \(Q^{*}\) is the posterior variational distribution. \(\hat{\phi}_2\) is computed using the function \(\texttt{vbPredictiveCdf}\).
\(\hat{\phi}_3(u) = \frac{1}{M}\sum_{t=1}^M \phi (u, \rho^t_{1:k}, \mu^t_{1:k}, \Sigma^t_{1:k})\) where \(M\) is the number of remaining samples after discarding the burn-in period and thinning. \(\hat{\phi}_3\) is computed using the function \(\texttt{MHpredictiveCdf}\).
First, we complete the function \(\texttt{MHpredictiveCdf}\).
We consider a range of thresholds on the diagonal line, \(u = (x, x), \forall x \in [−1, 4]\). Let’s complete the following code chunk in order to plot on the same graph, as a function of \(x\), \[ \begin{aligned} &\phi((x,x)|\rho_{1:k}, \mu_{1:k}, \Sigma_{1:k})\\ &\hat{\phi}_2(x,x)\\ &\hat{\phi}_3(x,x) \end{aligned} \]
We set \(\texttt{burnin} = 1333\) :
The probability of an excess of a threshold \(u \in \mathbb{R}\) for the Metropolis-Hastings model fits perfectly the curve for the true Gaussian mixture distribution. The one for the Variational Bayes model is slightly different for some thresholds.
Now, we consider the tail of the distribution : we consider a range of thresholds on the diagonal line, \(u = (x, x), \forall x \in [1, 5]\) :
For the tail of the mixture distributions, the results are similar, but \(\hat{\phi}_2\) seems to have lower probabilities for small values of \(u\).
We now focus on \(\hat{\phi}_3\). Let’s plot on the same graph the posterior 90% credible sets obtained with the empirical quantiles of the time series \(\phi((x,x)|\rho_{1:k}^t, \mu_{1:k}^t, \Sigma_{1:k}^t)\), \(t \in \{1,...,M\}\) :
The 90% credible sets are pretty thin around the probability of an excess.